In complement to our IGV genome browser course where we reviewed visualising genomics data in a browser, here we will use R/Bioconductor to produce publication quality graphics programatically.
In this session we will be dealing with a range of data types. For more information on file types you can revisit our material.
For more information on visualising genomics data in browsers you can visit our IGV course.
We will also encounter and make use of many data structures and data types which we have seen throughout our courses on HTS data. You can revisit this material to refresh on HTS data analysis in Bioconductor and R below.
Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.
Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.
You may navigate to the unarchived VisualisingGenomicsData folder in the Rstudio menu
Session -> Set Working Directory -> Choose Directory
or in the console.
setwd("/PathToMyDownload/VisualizingGenomicsData/viz_course/presentations/Slides")
# e.g. setwd("~/Downloads/VisualizingGenomicsData/viz_course/presentations/Slides") We have already discussed on using the IGV browser to review our data and get access to online data repositories.
IGV is quick, user friendly GUI to perform the essential task of review genomics data in its context.
For more information see our course on IGV here.
The Gviz packages offers methods to produce publication quality plots of genomics data at genomic features of interest.
To get started using Gviz in some biological examples, first we need to install the package.
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("Gviz") Now we have created a GenomeAxisTrack track object we can display the object using plotTracks function.
In order to display a axis track we need to set the limits of the plot (otherwise where would it start and end?).
plotTracks(genomeAxis,from=100,to=10100) We can also configure the resolution of the axis (albeit rather bluntly) using the littleTicks parameter.
This will add additional axis tick marks between those shown by default.
plotTracks(genomeAxis,from=100,to=10100,
littleTicks = TRUE) By default the plot labels for the genome axis track are alternating below and above the line.
We can further configure the axis labels using the labelPos parameter.
Here we set the labelPos to be always below the axis
plotTracks(genomeAxis,from=100,to=10100,
labelPos="below") In the previous plots we have produced a genomic axis which allows us to consider the position of the features within the linear genome.
In some contexts we may be more interested in relative distances around and between the genomic features being displayed.
We can configure the axis track to give us a relative representative of distance using the scale parameter.
plotTracks(genomeAxis,from=100,to=10100,
scale=1,labelPos="below") We may want to add only a part of the scale (such as with Google Maps) to allow the reviewer to get a sense of distance.
We can specify how much of the total axis we wish to display as a scale using a value of 0 to 1 representing the proportion of scale to show.
plotTracks(genomeAxis,from=100,to=10100,
scale=0.3) We can also provide numbers greater than 1 to the scale parameter which will determine, in absolute base pairs, the size of scale to display.
plotTracks(genomeAxis,from=100,to=10100,
scale=2500) We can add “regions of interest” to the axis plotted by Gviz as we have done with IGV.
To do this we will need to define some ranges to signify the positions of “regions of interest” in the linear context of our genome track.
Since the plots have no apparent context for chromosomes (yet), we will use a IRanges object to specify “regions of interest” as opposed to the genome focused GRanges.
To create an IRanges object we will load the IRanges library and specify vectors of start and end parameters to the IRanges constructor function.
library(IRanges)
regionsOfInterest <- IRanges(start=c(140,5140),end=c(2540,7540))
names(regionsOfInterest) <- c("ROI_1","ROI_2")
regionsOfInterest ## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## ROI_1 140 2540 2401
## ROI_2 5140 7540 2401
We include the names specified in the IRanges for the regions of interest within the axis plot by specify the showID parameter to TRUE.
plotTracks(genomeAxis,from=100,to=10100,
range=regionsOfInterest,
showId=T) Lets update our IRanges object to have some score columns in the metadata columns. We can do this with the mcols function as shown in our Bioconductor material.
mcols(regionsOfInterest) <- data.frame(Sample1=c(30,20),
Sample2=c(20,200))
regionsOfInterest <- GRanges(seqnames="chr5",
ranges = regionsOfInterest)
regionsOfInterest ## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | Sample1 Sample2
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## ROI_1 chr5 [ 140, 2540] * | 30 20
## ROI_2 chr5 [5140, 7540] * | 20 200
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
As we have seen, DataTrack objects make use of IRanges/GRanges which are the central workhorse of Bioconductors HTS tools.
This means we can take advantage of the many manipulations available in the Bioconductor tool set.
Lets make use of rtracklayer’s importing tools to retrieve coverage from a bigWig as a GRanges object
library(rtracklayer)
allChromosomeCoverage <- import.bw("../../Data/small_Sorted_SRR568129.bw",
as="GRanges")
class(allChromosomeCoverage)## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
Now we have our coverage as a GRanges object we can create our DataTrack object from this.
Notice we specify the chromsome of interest in the chromosome parameter.
accDT <- DataTrack(allChromosomeCoverage,chomosome="chr5")
accDT ## DataTrack 'DataTrack'
## | genome: NA
## | active chromosome: chrM
## | positions: 1
## | samples:1
## | strand: *
## There are 248 additional annotation features on 24 further chromosomes
## chr1: 1
## chr10: 1
## chr11: 1
## chr12: 1
## chr13: 1
## ...
## chr7: 1
## chr8: 1
## chr9: 1
## chrX: 1
## chrY: 1
## Call seqlevels(obj) to list all available chromosomes or seqinfo(obj) for more detailed output
## Call chromosome(obj) <- 'chrId' to change the active chromosome
We can adjust the type of plots we want using the type argument. Here as with standard plotting we can specify “l” to get a line plot.
plotTracks(accDT,
from=134887451,to=134888111,
chromosome="chr5",type="l") Histograms by specifying “h”.
plotTracks(accDT,
from=134887451,to=134888111,
chromosome="chr5",type="h") Or filled/smoothed plots using “mountain”.
plotTracks(accDT,
from=134887451,to=134888111,
chromosome="chr5",type="mountain") As with all plotting functions in R, Gviz plots are highly customisable.
Simple features such as point size and colour are easily set as for standard R plots using cex and col paramters.
plotTracks(accDT,
from=134887451,to=134888111,
chromosome="chr5",
col="red",cex=4) The order of tracks in the plot is simply defined by the order they are placed in the vector passed to plotTracks()
plotTracks(c(genomeAxis,accDT),
from=134887451,to=134888111,
chromosome="chr5"
) By default, Gviz will try and provide sensible track heights for your plots to best display your data.
The track height can be controlled by providing a vector of relative heights to the sizes parameter of the plotTracks() function.
Genomic annotation, such as Gene/Transcript models, play an important part of visualising genomics data in context.
Gviz provides many routes for constructing genomic annotation using the AnnotationTrack() constructor function.
toGroup <- GRanges(seqnames="chr5",
IRanges(
start=c(10,500,550,2000,2500),
end=c(300,800,850,2300,2800)
))
names(toGroup) <- seq(1,5)
toGroup ## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## 1 chr5 [ 10, 300] *
## 2 chr5 [ 500, 800] *
## 3 chr5 [ 550, 850] *
## 4 chr5 [2000, 2300] *
## 5 chr5 [2500, 2800] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We can see the features are displayed grouped by lines.
But if we want to see the names we must specify the group parameter by using the groupAnnotation argument.
plotTracks(annoT,groupAnnotation = "group") When we created the GRanges used here we did not specify any strand information.
strand(toGroup) ## factor-Rle of length 5 with 1 run
## Lengths: 5
## Values : *
## Levels(3): + - *
Now we can specify some strand information for the GRanges and replot.
Arrows now indicate the strand which the features are on.
strand(toGroup) <- c("+","+","*","-","-")
annoT <- AnnotationTrack(toGroup,
group = c("Ann1","Ann1",
"Ann2",
"Ann3","Ann3"))
plotTracks(annoT, groupingAnnotation="group") In the IGV course we saw how you could control the display density of certain tracks.
Annotation tracks are often stored in files such as the general feature format (see our previous course).
IGV allows us to control the density of these tracks in the view options by setting to “collapsed”, “expanded” or “squished”.
Whereas “squished” and “expanded” maintains much of the information within the tracks, “collapsed” flattens overlapping features into a single displayed feature.
By setting the stacking parameter to “dense”, all overlapping features have been collapsed/flattened
plotTracks(annoT, groupingAnnotation="group",stacking="dense") We can set our own feature types for the AnnotationTrack object using the same feature() function.
We can choose any feature types we wish to define.
feature(annoT) <- c(rep("Good",4),rep("Bad",2))
feature(annoT) ## [1] "Good" "Good" "Good" "Good" "Bad" "Bad"
We have seen how we can display complex annotation using the AnnotationTrack objects.
For gene models Gviz contains a more specialised object, the GeneRegionTrack object.
The GeneRegionTrack object contains additional parameters and display options specific for the display of gene models.
Lets start by looking at the small gene model set stored in the Gviz package.
data(geneModels)
geneModels[1,]## chromosome start end width strand feature gene
## 1 chr7 26591441 26591829 389 + lincRNA ENSG00000233760
## exon transcript symbol
## 1 ENSE00001693369 ENST00000420912 AC004947.2
We can define a GeneRegionTrack as we would all other tracktypes. Here we provide a genome name, chromosome of interest and a name for the track.
grtrack <- GeneRegionTrack(geneModels, genome = "hg19",
chromosome = "chr7",
name = "smallRegions")
plotTracks(grtrack) plotTracks(grtrack) We can see that features here are rendered slightly differently to those in an AnnotationTrack object.
Here direction is illustrated by arrows in introns and unstranslated regions are shown as narrower boxes.
Similarly we can label transcripts by their individual transcript names.
plotTracks(grtrack,transcriptAnnotation="transcript") Or we can label using the transcriptAnnotation object by any arbitary column where there is one level per transcript.
plotTracks(grtrack,transcriptAnnotation="symbol") As with transcripts we can label individual features using the exonAnnotation parameter by any arbitary column where there is one level per feature/exon.
plotTracks(grtrack,exonAnnotation="exon",
from=26677490,to=26686889,cex=0.5) We saw that we can control the display density when plotting AnnotationTrack objects.
We can control the display density of GeneRegionTracks in the same manner.
plotTracks(grtrack, stacking="dense") However, since the GeneRegionTrack object is a special class of the AnnotationTrack object we have special parameter for dealing with display density of transcripts.
The collapseTranscripts parameter allows us a finer degree of control than that seen with stacking parameter. Here we set collapseTranscripts to be TRUE inorder to merge all overlapping transcripts.
plotTracks(grtrack, collapseTranscripts=T,
transcriptAnnotation = "symbol") Collapsing using the collapseTranscripts has summarised our transcripts into their respective gene boundaries.
We have however lost information on the strand of transcripts. To retain this information we need to specify a new shape for our plots using the shape parameter. To capture direction we use the “arrow” shape
plotTracks(grtrack, collapseTranscripts=T,
transcriptAnnotation = "symbol",
shape="arrow") The collapseTranscripts function also allows us some additional options by which to collapse our transcripts.
These methods maintain the intron information in the gene model and so get us closer to reproducing the “collapsed” feature in IGV.
Here we may collapse transcripts to the “longest”.
plotTracks(grtrack, collapseTranscripts="longest",
transcriptAnnotation = "symbol") We have seen in previous material how gene models are organised in Bioconductor using the TxDB objects.
Gviz may be used in junction with TxDB objects to construct the GeneRegionTrack objects.
We saw in the Bioconductor and ChIPseq course that many genomes have pre-build gene annotation within the respective TxDB libraries. Here we will load a TxDb for hg19 from the TxDb.Hsapiens.UCSC.hg19.knownGene library.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb ## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
With our new GeneRegionTrack we can now reproduce the gene models using the Bioconductor TxDb annotation.
Here the annotation is different but transcripts overlapping uc003syc are our SKAP2 gene.
plotTracks(customFromTxDb,
from=26591341,to=27034958,
transcriptAnnotation="gene") Now by combining the ability to create our own TxDb objects from GFFs we can create a very custom GeneRegionTrack from a GFF file. Here i use the UCSC table browser to extract a GTF of interest.
library(GenomicFeatures)
ensembleGTF <- "~/Downloads/hg19.gtf.gz"
txdbFromGFF <- makeTxDbFromGFF(file = ensembleGTF)
customFromTxDb <- GeneRegionTrack(txdbFromGFF,chromosome="chr7")
plotTracks(customFromTxDb,
from=26591341,to=27034958,
transcriptAnnotation="gene") Time for solutions! Link here